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Abstract 

Degeneracy of the genetic code is a biological way to min- 
imize effects of the undesirable mutation changes. Degenera- 
tion has a natural description on the 5-adic space of 64 codons 
(^5(64) = {no + ni 5 + 77-2 5^ : rii = 1,2,3,4}, where are 
digits related to nucleotides as follows: C = l, A = 2, T = U 
= 3, G = 4. The smallest 5-adic distance between codons joins 
them into 16 quadruplets, which under 2-adic distance decay 
into 32 doublets. p-Adically close codons are assigned to one of 
20 amino acids, which are building blocks of proteins, or code 
termination of protein synthesis. We shown that genetic code 
multiplets are made of the p-adic nearest codons. 



1 Introduction 

Genetic information in living systems is contained in the desoxyri- 
bonucleic acid (DNA) sequence. The DNA macromolecules are com- 
posed of two polynucleotide chains with a double-helical structure. 
The building blocks of the genetic information are four nucleotides 
called: adenine (A), guanine (G), cytosine (C) and thymine (T). A 
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and G are purines, while C and T are pyrimidines. Nucleotides are 
arranged along double helix through base pairs A-T and C-G. The 
DNA is packaged in chromosomes which are localized in the nucleus 
of the eukaryotic cells. One of the basic processes within DNA is its 
replication. The passage of DNA gene information to proteins, called 
gene expression, performs by the messenger ribonucleic acids (mRNA), 
which are usually single polynucleotide chains. The mRNA are syn- 
thesized in the first part of this process, known as transcription, when 
nucleotides A, G, C, T from DNA are respectively transcribed into 
their complements U, C, G, A of mRNA, where T is replaced by U 
(U is the uracil). The next step is translation, when the information 
coded by codons in the mRNA is translated into proteins. In this 
process two other RNA's are involved: transfer tRNA and ribosomal 
rRNA. Codons are ordered sequences of three nucleotides taken of the 
A, G, C, U. Protein synthesis in all eukaryotic cells performs in the 
ribosomes of the cytoplasm. 

The genetic code relates the information of the sequence of codons 
in mRNA to the sequence of amino acids in a protein. Although there 
are about dozen codes (see, e.g. fl!]), the most important are two of 
them: the eukaryotic code and the vertebral mitochondrial code. In 
the sequel we shall mainly consider the vertebral mitochondrial code, 
because it looks the simplest one and the others can be regarded as its 
modifications. It is obvious that there are 4 x 4 x 4 = 64 codons. How- 
ever (in the vertebral mitochondrial code), 60 of them are distributed 
on the 20 different amino acids and 4 make stop-codons, which serve 
as termination signals. According to experimental observations, two 
amino acids are coded by six codons, six amino acids by four codons, 
and twelve amino acids by two codons. This property that to an amino 
acid corresponds more than one codon is known as genetic code de- 
generacy. This degeneracy is a very important property of the genetic 
code and gives an efficient way to minimize effects of the undesir- 
able mutation changes. Since there is a huge number (about 10^°) of 
all possible assignments between codons and amino acids, and only 
a very small number (about dozen) of them is represented in living 
cells, it has been a persistent theoretical challenge to find an appro- 
priate model explaining contemporary genetic codes. Still there is 
no generally accepted explanation of the genetic code. For a detail 
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and comprehensive information on molecular biology aspects of DNA, 
RNA and genetic code one can see Ref. [2]. It is worth mention- 
ing that human genome, which presents all genetic information of the 
homo sapiens, is composed of about three billions DNA base pairs and 
contains more than 20.000 genes. 

Modeling of DNA, RNA and genetic code is a challenge as well 
as an opportunity for modern mathematical physics. An interesting 
model based on the quantum algebra lAq{sl{2) © sl{2)) in the g — > 
limit was proposed as a symmetry algebra for the genetic code (see pQ 
and references therein). In a sense this approach mimics quark model 
of baryons. To describe correspondence between codons and amino- 
acids, it was constructed an operator which acts on the space of codons 
and its eigenvalues are related to amino acids. Besides some successes 
of this approach, there is a problem with rather many parameters in 
the operator. There are also papers [3] starting with 64-dimensional 
irreducible representation of a Lie (super) algebra and trying to connect 
multiplicity of codons with irreducible representations of subalgebras 
arising in a chain of symmetry breaking. Although interesting as an 
attempt to describe evolution of the genetic code these Lie algebra 
approaches did not succeed to get its modern form. For a very brief 
review of these and some other theoretical approaches to the genetic 
code one can see Ref. [1]. 

Recently we introduced a p-adic approach to the DNA, RNA se- 
quences and genetic code [1]. Let us mention that p-adic models in 
mathematical physics have been actively considered since 1987 (see [S] , 
[6] for early reviews and [7], [8] for some recent reviews). It is worth 
noting that p-adic models with pseudodifferential operators have been 
successfully applied to interbasin kinetics of proteins [9]. Some p- 
adic aspects of cognitive, psychological and social phenomena have 
been also considered [lU] . The present status of application of p-adic 
numbers in physics and related branches of sciences is reflected in the 
proceedings of the 2nd International Conference on p-Adic Mathemat- 
ical Physics ^llj. The main goal of this paper is to present p-adic root 
of the genetic code and, in particular, its degeneracy. 
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2 ^-Adic space of codons 



An elementary introduction to p-adic numbers can be found in the 
book [12]. However, for our purposes we will use here only a bit of p- 
adics, mainly a finite set of integers and ultrametric distance between 
them. 

Let us introduce the set of natural numbers 

C5(64) = {no + ni 5 + 5^ : m = 1, 2, 3, 4} , (1) 

where are digits related to nucleotides by the following assignment: 
C = 1, A = 2, T = U = 3, G = 4. This is an expansion to the base 5. 
It is obvious that 5 is a prime number and that the set C5(64) contains 
64 numbers between 31 and 124 (in the usual base 10). In the sequel 
we shall denote elements of C5(64) by their digits to the base 5 in the 
following way: no + rii 5 + ^2 5^ = uq rii n2. Note that here ordering of 
digits is the same as in the expansion ([1]), i.e this ordering is opposite 
to the usual one. There is now evident one-to-one correspondence 
between codons in letter XYZ and number no rii n2 representations. 

In addition to arithmetic operations it is often important to know 
also a distance between numbers. Distance can be defined by a norm. 
On the set Z of integers there are two kinds of nontrivial norm: usual 
absolute value | ■ |oo and p-adic absolute value | ■ |p , where p is any prime 
number. The usual absolute value is well known from elementary 
courses of mathematics and the corresponding distance between two 
numbers x and y is d^oix, y) = \x — y\oo- 

The p-adic absolute value is related to the divisibility of integers by 
prime numbers, and p-adic distance can be understood as a measure of 
this divisibility for the difference of two numbers (the more divisible, 
the shorter). By definition, p-adic norm of an integer m G Z, is 
\m\p = p~^, where k G N1J{0} is degree of divisibility of m by prime 
p (i.e. m = p^m' , p\ m') and |0|p = 0. This norm is a mapping from 
Z into non-negative real numbers and has the following properties: 

(i) \x\p — 0) \^\p = if and only if x = 0, 

(ii) \xy\p = \x\p \y\p, 

(iii) \x + y\p < max , \y\p} < \x\p + \y\p for all x , y G Z. 
Because of the strong triangle inequality \x + y\p < max{|x|p, \y\p}, 
p-adic absolute value belongs to non- Archimedean (ultrametric) norm. 
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One can easily conclude that < \m\p < 1. 

p-Adic distance between two integers x and y is 

dpix ,y) = \x - y\p. (2) 

Since p-adic absolute value is ultrametric, the p-adic distance ([2]) is 
also ultrametric, i.e. it satisfies 

dp{x , y) < max {dp{x , z) , dp{z ,y)} < dp{x , z) + dp{z , y) , (3) 

where x, y and z are any three integers. 

The above introduced set C5(64) endowed by p-adic distance we 
shall call p-adic space of codons. 5-Adic distance between two codons 
a,6 e C5(64) is 

d^{a, h) = |ao + ai 5 + a2 5^ - 6o - ^1 5 - &2 S^ls . (4) 

When a ^ b then d^{a, b) may have three different values: (i) ^5(0, b) = 
1 if ao 7^ bo, (ii) d^i^a, b) = 1/5 if oq = &o and oi 7^ 61, and (iii) 
c?5(a, b) = 1/5^ if ao = &o , oi = 61 and 02 7^ &2- We see that the 
maximum 5-adic distance between codons is 1 and it is equal to the 
maximum p-adic distance on Z. Let us also note that this distance 
depends only on the first two nucleotides in the codons. Use of 5- 
adic distance between codons is a natural one to describe information 
similarity between them. 

In the case of standard distance doo{a, b) = |ao + ai 5 + 02 5^ — 60 — 
61 5 — 62 5^ loo; third nucleotides 02 and 62 play more important role 
than those at the second place (i.e aiandfoi), and nucleotides Oq and 
bo are of the smallest importance. 

3 ;?-Adic genetic code 

Living cells are very complex systems composed mainly of proteins 
which play various roles. These proteins are long linear chains made 
of only 20 amino acids, which are the same for all living world on the 
Earth. Different sequences of amino acids form different proteins. 

An intensive study of connection between ordering of nucleotides in 
the DNA (and RNA) and ordering of amino acids in proteins led to the 
experimental discovery of genetic code in the mid-1960s. Genetic code 
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is understood as a dictionary for translation of information from the 
DNA (through RNA) to production of proteins by amino acids. The 
information on amino acids is contained in codons. To the sequence of 
codons in the RNA corresponds quite definite sequence of amino acids 
in a protein, and this sequence of amino acids determines a primary 
structure of the protein. 

However, there is no simple theoretical understanding of genetic 
coding. In particular, it is not clear why genetic code exists just in the 
known way and not in many other possible ways. What is a principle 
(or principles) used in establishment of a basic (mitochondrial) code? 
What are properties of codons connecting them into definite multiplcts 
which code the same amino acid or termination signal? These are 
only some of many questions whose answers should lead us to make 
an appropriate theoretical model of the genetic code. 



111 CCC Pro 

112 CCA Pro 

113 ecu Pro 

114 CCC Pro 


211 ACC Thr 

212 ACA Thr 

213 ACU Thr 

214 ACG Thr 


311 UCC Scr 

312 UCA Ser 

313 UCU Ser 

314 UCG Ser 


411 CCC Ala 

412 GCA Ala 

413 GCU Ala 

414 GCG Ala 


121 CAC His 

122 CAA Gin 

123 CAU His 

124 GAG Gin 


221 A AC Asn 

222 AAA Lys 

223 AAU Asn 

224 AAG Lys 


321 UAC Tyr 

322 UAA Ter 

323 UAU Tyr 

324 UAG Ter 


421 GAG Asp 

422 CAA Glu 

423 CAU Asp 

424 GAG Glu 


131 cue Leu 

132 CUA Leu 

133 CUU Leu 

134 CUG Leu 


231 AUG lie 

232 AUA Met 
233 AUU He 
234 AUG Met 


331 UUC Phe 

332 UUA Leu 

333 UUU Phe 

334 UUG Leu 


431 GUC Val 

432 GUA Val 

433 CUU Val 

434 GUG Val 


141 CGC Arg 

142 CGA Arg 

143 CGU Arg 

144 CGG Arg 


241 AGC Ser 

242 ACA Ter 

243 ACU Ser 

244 AGG Ter 


341 UGC Cys 

342 UCA Trp 

343 UCU Cys 

344 UGG Trp 


441 GGC Gly 

442 CCA Gly 

443 GCU Gly 

444 GGG Gly 



Table: The vertebral mitochondrial code 
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Let us now look at the experimental Table of the vertebral mi- 
tochondrial code and compare it with the above introduced C5(64) 
codon space. To this end, codons are simultaneously denoted by three 
digits and standard capital letters (recall: C=l, A=2, U=3, G=4). 
The corresponding amino acids are presented in the usual three-letter 
form. 

First of all let us note that our Table is constructed according to 
the gradual change of digits and, as a consequence, there is a dif- 
ferent spatial distribution of amino acids comparing to the standard 
(Watson-Crick) table (see, e.g. Any of these tables can be re- 

garded as a big rectangle divided into 16 equal smaller rectangles: 
8 of them are quadruplets which one-to-one correspond to 8 amino 
acids, and other 8 rectangles are divided into 16 doublets coding 14 
amino acids and termination (stop) codon (by two doublets at dif- 
ferent places). Note that 2 of 16 doublets code 2 amino acids (Ser 
and Leu) which are already coded by 2 quadruplets, thus amino acids 
Serine and Leucine are coded by 6 codons. In our Table quadruplets 
and doublets together form a figure, which is symmetric with respect 
to the mid vertical line, i.e. it is invariant under interchange 1 < — > 4 
and 2 < — > 3 of the first digits in codons. Recall that the DNA is sym- 
metric under simultaneous interchange of complementary nucleotides 
in its strands. In other words, the DNA is invariant under nucleotide 
interchange 1 < — > 4 and 2 < — > 3 between strands. All doublets in 
the Table form a nice figure which looks like letter T. 

Now we can look at the Table as a representation of the (^5(64) 
codon space. Namely, we observe that there are 16 quadruplets such 
that each of them has the same first two digits. Hence 5-adic distance 
between any two different codons inside a quadruplet is 

(i5(a, b) = |ao + ai5 + a25^-ao-ai5-625^|5 = | (02 -62)5^15 = 5"^, 

(5) 

because Qq = bo, ai = bi and \a2 — ^215 = 1- 

Since codons are composed of three nucleotides, each of which is 
either a purine or a pyrimidine, it is natural to try to quantify sim- 
ilarity inside purines and pyrimidines, as well as distinction between 
elements from these two groups of nucleotides. Fortunately there is a 
tool, which is again related to the ]3-adics, and now it is 2-adic dis- 
tance. One can easily see that the 2-adic distance between pyrimidines 
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C and U is 1/2 as the distance between purines A and G. However 
2-adic distance between C and A or G as well as distance between U 
and A or G is 1 (i.e. maximum). 

With respect to the 2-adic distance, the above quadruplets may 
be regarded as composed of two doublets: 1 and b 

make the first doublet, and c = ao ai 2 and d = ao ai 4 form the second 
one. 2-Adic distance between codons within each of these doublets is 
i, i.e. 

d2(a, 6) = 1(3-1)5^12 = ^, d2(c,d) = 1(4-2)5^12 = ^, (6) 

because 3 — 1= 4 — 2 = 2. 

One can now look at the Table as a system of 32 doublets. Thus 64 
codons are clustered by a very regular way into 32 doublets. Each of 
21 subjects (20 amino acids and 1 termination operation) is coded by 
one, two or three doublet. In fact, there are two, six and twelve amino 
acids coded by three, two and one doublets, respectively. Residual 
two doublets code termination signal. 

To have a more complete picture on the genetic code it is useful 
to consider possible distances between codons from different quadru- 
plets as well as from different doublets. Also, we introduce distance 
between quadruplets or between doublets, especially when distances 
between their codons have the same value. Thus 5-adic distance be- 
tween a quadruplet and quadruplets in the same column is 1/5, while 
such distance toward all other quadruplets is 1. 5-Adic distance be- 
tween doublets coincides with distance between quadruplets, and this 
distance is ^ when doublets are inside the same quadruplet. 

The 2-adic distance between codons, doublets and quadruplets is 
more complex. There are three basic cases: (1) codons differ only in 
one digit, (2) codons differ in two digits, and (3) codons differ in all 
three digits. In the first case, 2-adic distance can be | or 1 depending 
whether difference between digits is 2 or not, respectively. 

Let us now look at 2-adic distances between doublets coding Leucine 
and also between doublets coding Serine. These are two cases of amino 
acids coded by three doublets. Doublet consisting of codons 332 and 
334 should be compared with doublet of codons 132 and 134. The 
largest 2-adic distance between them is |. We again obtain maximum 
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distance | for Serine when we compare doublets (311, 313) and (241, 
243). 

Other known codes may be regarded as some modifications of the 
vertebral mitochondrial code (inside five quadruplets of T-like region 
and quadruplet coding Leucine). The modification means that some 
codons change their meaning and code either other amino acids or 
termination signal. So, in the universal (standard, canonical) code 
there are the following changes: (i) 232 AUA: Met He, (ii) 242 
AGA and 244 AGG: Ter ^ Arg, (iii) 342 UGA: Trp ^ Ter. 

4 Discussion and concluding remarks 

We have chosen p = 5 as the base in expansion of an clement of 
the C5(64) space of codons, because 5 is the smallest prime number 
which contains four nucleotides (A , T , G , C) in DNA, or (A , U , G , 
C) in RNA, in the form of four different digits. At the first glance, 
because there are four nucleotides, one could start to think that a 4- 
adic expansion, which has just four digits, might be more appropriate. 
However, note that 4 is a composite integer and that such expansion 
is not suitable since the corresponding | ■ I4 absolute value is not a 
norm but a pseudonorm and it makes a problem with uniqueness of 
the distance between two points. To illustrate this problem let us 
consider, for instance, a distance between numbers 4 and 0. Then we 
have (^4(0, 4) = |4|4 = ^, but on the other hand ^4(0, 4) = |2|4 |2|4 = 1. 

Recall that there arc generally 5 digits (0, 1, 2, 3, 4) in repre- 
sentation of 5-adic numbers. In this approach, we omitted the digit 
to represent a nucleotide, because its consistent meaning can be only 
absence of any nucleotide. 

Let us note that there are in general 24 possibilities to connect four 
digits with four nucleotides. However, we find that the above choice 
seems to be the most appropriate. 

An essential property of the C^i^Qi) space of codons is ultrametric 
behavior of distances between its elements, which radically differs from 
usual distances. One can easily observe that quadruplets and doublets 
of codons in the vertebral mitochondrial code have natural explanation 
within 5-adic and 2-adic closeness. It follows that degeneracy of the 
genetic code in the form of doublets, quadruplets and sextuplets is 
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direct consequence of p-adic ultrametricity between codons. 

There is an important aspect of genetic coding related to particular 
connections between codons and amino acids. Namely, which amino 
acid corresponds to which multiplet of codons? An answer should 
be related to connections between stereochemical properties of codons 
and amino acids. 

Let us also note a recent paper |13j , where an ultrametric approach 
to the genetic code is considered on a diadic plane. 
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